The data we have chosen to look at is Housing Prices in California.
This data comes from Kaggle (https://www.kaggle.com/datasets/fedesoriano/california-housing-prices-data-extra-features)
and outlines data that would go in to predicting the price of a house in
California. As people who currently rent (and one of us living in
California), we hope to one day be able to purchase a home and being
able to understand this model could help us determine important factors
in predicting the price and whether future ones we intend to buy are a
good deal or not. Our model will predict
Median House Value, which will take possible predictors
like Median Income and Total Rooms to predict
that price.
First we need to load in the data and prepare some of the columns. To
augment the data a bit, we need to take the predictors
Distance to Los Angeles,
Distance to Los Angeles,
Distance to Los Angeles, and
Distance to Los Angeles and convert them in to a single
column that is a factor variable. This segments the data a bit in to
regions of California and if there is any relevance in being closer to
one city vs. another.
library(readr)
housing_data = read_csv("California_Houses.csv")
## Rows: 20640 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): Median_House_Value, Median_Income, Median_Age, Tot_Rooms, Tot_Bedr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nearest_city = rep("", nrow(housing_data))
nearest_city_options = c("LA", "San Diego", "San Jose", "San Fransisco")
for (i in 1:nrow(housing_data)) {
subset = housing_data[i,c("Distance_to_LA", "Distance_to_SanDiego", "Distance_to_SanJose", "Distance_to_SanFrancisco")]
nearest_city[i] = nearest_city_options[which.min(subset)]
}
housing_data$nearest_city = as.factor(nearest_city)
Below is a peak at the data, with the new added column
head(housing_data)
## # A tibble: 6 × 15
## Median_House_Value Median_Income Median_Age Tot_Rooms Tot_Bedrooms Population
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 452600 8.33 41 880 129 322
## 2 358500 8.30 21 7099 1106 2401
## 3 352100 7.26 52 1467 190 496
## 4 341300 5.64 52 1274 235 558
## 5 342200 3.85 52 1627 280 565
## 6 269700 4.04 52 919 213 413
## # … with 9 more variables: Households <dbl>, Latitude <dbl>, Longitude <dbl>,
## # Distance_to_coast <dbl>, Distance_to_LA <dbl>, Distance_to_SanDiego <dbl>,
## # Distance_to_SanJose <dbl>, Distance_to_SanFrancisco <dbl>,
## # nearest_city <fct>
set.seed(420)
housing_data_idx = sample(nrow(housing_data), size = trunc(0.80 * nrow(housing_data)))
housing_data_trn = housing_data[housing_data_idx, ]
housing_data_tst = housing_data[-housing_data_idx, ]
With the data loaded and prepped we want to start building the model. Before we do that, we want to check the pairs of all the different variables to see if there are and predictors we need to transform.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(housing_data_trn,
columns = c(1, 2:5), # Columns
aes(color = nearest_city, # Color by group (cat. variable)
alpha = 0.5))
ggpairs(housing_data_trn,
columns = c(1, 6:9), # Columns
aes(color = nearest_city, # Color by group (cat. variable)
alpha = 0.5))
ggpairs(housing_data_trn,
columns = c(1, 10:13), # Columns
aes(color = nearest_city, # Color by group (cat. variable)
alpha = 0.5))
ggpairs(housing_data_trn,
columns = c(1, 14:15), # Columns
aes(color = nearest_city, # Color by group (cat. variable)
alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.